arXiv: 1503.07157v2 [math.NA] 17Jun2015 


A RANDOMIZED BLOCKED ALGORITHM FOR EFFICIENTLY COMPUTING 
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Abstract: This manuscript describes a technique for computing partial rank- 
revealing factorizations, such as, e.g, a partial QR factorization or a partial 
singular value decomposition. The method takes as input a tolerance e and 
an m X n matrix A, and returns an approximate low rank factorization of A 
that is accurate to within precision e in the Frobenius norm (or some other 
easily computed norm). The rank k of the computed factorization (which 
is an output of the algorithm) is in all examples we examined very close to 
the theoretically optimal e-rank. The proposed method is inspired by the 
Gram-Schmidt algorithm, and has the same 0(mnk ) asymptotic flop count. 

However, the method relies on randomized sampling to avoid column pivoting, 
which allows it to be blocked, and hence accelerates practical computations 
by reducing communication. Numerical experiments demonstrate that the 
accuracy of the scheme is for every matrix that was tried at least as good as 
column-pivoted QR, and is sometimes much better. Computational speed is 
also improved substantially, in particular on GPU architectures. 

1. Introduction 

1.1. Problem formulation. This manuscript describes an algorithm based on randomized sam¬ 
pling for computing an approximate low-rank factorization of a given matrix. To be precise, given 
a real or complex matrix A of size m x n, and a computational tolerance e, we seek to determine 
a matrix A ap p rox of low rank such that 

(1) 11 A A a pp r0 x || Y E. 

For any given k G {1, 2, ..., min(m,n)}, a rank-A: approximation to A that is in many ways 
optimal is given by the partial singular value decomposition (SVD), 

( 2 ) = 

v ' m x n mxkkxkkxn 

where U/t and V& are orthonormal matrices whose columns consist of the first k left and right 
singular vectors, respectively, and is a diagonal matrix whose diagonal entries {crj}j =1 are the 
leading k singular values of A, ordered so that a\ > 02 > 03 > • • • > ak > 0. The Eckart-Young 
theorem [3] states that for the spectral norm and the Frobenius norm, the residual error is minimal, 

||A — Afc|| = inf{|| A — C|| : C has rank k}. 

However, computing the factors in ([ 2 ]) is computationally expensive. In contrast, our objective is 
to find an approximant A a p prox that is cheap to compute, and close to optimal. 

The method we present is designed for situations where A is sufficiently large that computing 
the full SVD is not economical. The method is designed to be highly communication efficient, 
and to execute efficiently on both shared and distributed memory machines. It has been tested 
numerically for situations where the matrix fits in RAM on a single machine. We will, without loss 
of generality, assume that m > n. For the most part we discuss real matrices, but the generalization 
to complex matrices is straight-forward. 
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1.2. A greedy template. A standard approach in computing low-rank factorizations is to employ 
a greedy algorithm to build, one vector at a time, an orthonormal basis {q j}j=i that approximately 
spans the columns of A. To be precise, given an m x n matrix A and a computational tolerance 
e, our objective is to determine a rank k, and an m x k matrix Q& = [q 1 • • • q fc ] with orthonormal 
column vectors such that II A - QfcB fc || < e, where = Q£A. The matrices Q*. and B^. may be 
constructed jointly via the following procedure: 


Algorithm 1 



(1) Qo = [ ]; B 0 = [ ]; 

(2) while A^3 > £ 

(3) 3=3 + 1 

o 

II 

< 

II 

o 

< 

(4) Pick a 

(5) b j = q 

unit vector q - e ran(A^ b). 

II II 

O CO 

Qi-i % 

B j-i 

b J 

] 

(8) = 

(9) end while 

(10) k=j. 

Atf-b 

- 


Note that A ( b can overwrite A^ 11 . It can be verifiedthat if the algorithm is executed in exact 
arithmetic, then the matrices generated satisfy 

(3) A^ = A - QjQ*A, and Bj = Q*A. 

The performance of the greedy scheme is determined by how we choose the vector q y on line 

(4) . If we pick q^ as simply the largest column of A*'"'” 1 ), scaled to yield a vector of unit length, 
then we recognize the scheme as the column pivoted Gram-Schmidt algorithm for computing a 
QR factorization. This method often works very well, but can lead to sub-optimal factorizations. 
Reference [5] discusses this in detail, and also provides an improved pivoting technique that can 
be proved to yield closer to optimal results. However, both standard Gram-Schmidt (see, e.g., [3J 
Sect. 5.2]), and the improved version in [5] are challenging to implement efficiently on modern 
multicore processors since they cannot readily be blocked. Expressed differently, they rely on 
BLAS2 operations rather than BLAS3. 

Another natural choice for q y on line (4) is to pick the unit vector that minimizes ||A^'~b — 
qq*A (j - 1) ||. This in fact leads to an optimal factorization, with the vectors {q j}j=\ being left 
singular vectors of A. However, finding the minimizer tends to be computationally expensive. 

In this manuscript, we propose a scheme that is more computationally efficient than column- 
pivoted Gram-Schmidt, and often yields close to minimal approximation errors. The idea is to 
choose q ■ as a random linear combination of the columns of A ( - , ~ 1 *. To be precise, we propose the 
following mechanism for choosing q ; : 


(4a) Draw a random vector u whose entries are iid Gaussian random variables. 
(4b) Set y = A^ -1 ^ 

(4c) Normalize so that q ; = jAy y. 
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This scheme is mathematically very close to the low-rank approximation scheme proposed in [6], 
but is slightly different in the stopping criterion used (the scheme of |6] does not explicitly update 
the matrix, and therefore relies on a probabilistic stopping criterion), and in its performance when 
executed with finite precision arithmetic. We argue that choosing the vector q ; using randomized 
sampling leads to performance very comparable to traditional column pivoting, but has a deci¬ 
sive advantage in that the resulting algorithm is easy to block. We will demonstrate substantial 
practical speed-up on both multicore CPUs and GPUs. 


Remark 1 . The factorization scheme described in this section produces an approximate factoriza¬ 
tion of the form A ~ where Q ^ is orthonormal, but no conditions are a priori imposed on 

B/,,. Once the factors Q& and B& are available, it is simple to compute many standard factorizations 
such as the low rank QR, SVD, or CUR factorizations. For details, see Section\T3. 


2. Technical preliminaries 


2.1. Notation. Throughout the paper, we measure vectors in M n using their Euclidean norm. 

/ \ V 2 

The default norm for matrices will be the Frobenius norm ||A|| = ( |A(i,j)| 2 ) , although 

other norms will also be discussed. 


We use the notation of Golub and Van Loan [I] to specify submatrices. In other words, if B is 

Ji, J 2 , • • ■, je] are two index 


an m x n matrix with entries bij, and I = [i 1 , i-i, 
vectors, then we let B(J, J) denote the k x i matrix 

, ik\ and J 

B(/,J) = 

bilji 

bi 2 ji 

biij 2 

bi 2 j 2 

biije 

bi 2 j e 


bikjl 

bikj 2 

bikje 


We let B(J, :) denote the matrix B(I, [1,2,..., n]), and define B(:, J) analogously. 


The transpose of B is denoted B*, and we say that a matrix U is orthonormal if its columns form 
an orthonormal set, so that U*U = /. 


2.2. The singular value decomposition (SVD). The SVD was introduced briefly in the in¬ 
troduction. Here we define it again, with some more detail added. Let A denote an m x n matrix, 
and set r = min(m, n). Then A admits a factorization 

A = U Z V*, 

m x n m x r r x r r x n 


where the matrices U and V are orthonormal, and Z is diagonal. We let {u ;}[ =1 and {vi }[ =1 
denote the columns of U and V, respectively. These vectors are the left and right singular vectors 
of A. As in the introduction, the diagonal elements {cxy}y =1 of Z are the singular values of A. We 
order these so that (T\ > <72 > • • ■ > ay > 0. We let A& denote the truncation of the SVD to its 
first k terms, as defined by ©• It is easily verified that 

( min(m,n) \ 

Y of , 
j=k+l J 

where 11 A 11 spec t ra i denotes the operator norm of A and ||A|| denotes the Frobenius norm of A. 
Moreover, the Eckart-Young theorem [3j states that these errors are the smallest possible errors 
that can be incurred when approximating A by a matrix of rank k. 
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2.3. 

( 6 ) 


The QR factorization. Any m x n matrix A admits a QR factorization of the form 

A P = Q R, 

m x n n x n m x r r x n 


where r = min(m,n), Q is orthonormal, R is upper triangular, and P is a permutation matrix. 
The permutation matrix P can more efficiently be represented via a vector J c e Z” of column 
indices such that P = l(:, J c ) where I is the n x n identity matrix. Then ([B]) can be written 

fij\ A(: , J c ) = Q R, 

' ' m x n m x r r x n 


The QR-factorization is often computed via column pivoting combined with either the Gram- 
Schmidt process, Householder reflectors, or Givens rotations [3]. The resulting upper triangular 
R then satisfies various decay conditions [5]. These techniques are all incremental, and can be 
stopped after the first k terms have been computed to obtain a “partial QR-factorization of A"’: 


( 8 ) 


A (:,J C ) ~ Qfc Rfc- 

m x n m x r r x n 


The main drawback of the classical partial pivoted QR approximation is the difficulty to obtain 
substantial speedups on multi-processor architectures. 


2.4. Orthonormalization. Given an m x £ matrix X, with m > £, we introduce the function 

Q = orth(X) 

to denote orthonormalization of the columns of X. In other words, Q will be an mxi orthonormal 
matrix whose columns form a basis for the column space of X. In practice, this step is typically 
achieved most efficiently by a call to a packaged QR factorization (e.g., in Matlab, we would write 
Q = qr(X, 0)). This step could in principle be implemented without pivoting, which makes this 
call efficient. 


3. Construction of low-rank approximations via randomized sampling 


3.1. A basic randomized scheme. Let A be a given mxn matrix whose singular values exhibit 
some decay, and suppose that we seek a matrix Q with £ orthonormal columns such that 

(9) A ~ Q B, where B = Q* A. 

In other words, we seek a matrix Q whose columns form an approximate orthonormal basis for 
the column space of A. A randomized procedure for solving this task was proposed in pr|, and 
later analyzed an elaborated in M- A basic version of the scheme that we call “randQB” is given 
in Figure [lj Once randQB has been executed to produce the factors Q and B in Q, standard 
factorizations such as the QR factorization, or the truncated SVD can easily be obtained, as 
described in Section [3731 


3.2. Over sampling and theoretical performance guarantees. The algorithm randQB de¬ 
scribed in Section 3H produces close to optimal results for matrices whose singular values decay 
rapidly, provided that some slight over-sampling is done. To be precise, if we seek to match the 
minimal error for a factorization of rank k, then choose £ in randQB as 

£ = k + s 

where s is a small integer (say s = 10). It was shown in [6j Thm. 10.5] that if s > 2, then 


E[||A-QB||] < 1 + 


min (m,n) 



1/2 


CTa 
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function [Q, B] = randQB(A,f) 


(1) 

Q = randn(n, l) 


(2) 

Q = orth(A£2) 

Cm m mnl + Cqr ml 2 

(3) 

B = Q*A 

Cm m mnl 


Figure 1. The most basic version of the randomized range finder. The algorithm 
takes as input an m x n matrix A and a target rank l, and produces factors Q and 
B of sizes m x l and £ x n, respectively, such that A ~ QB. Text in blue refers to 
computational cost, see Section |4~4] for notation. 


where E denotes expectation. Recall from equation (5) that I Y2j= 


,min(m,n) 2~\ 


cr 


is the theoretically 


j]=k+l 

minimal error in approximating A by a matrix of rant k, so we miss the optimal bound only by a 
factor of ^1 T (except for the over sampling, of course). Moreover, it can be shown that the 

likelihood of a substantial deviation from the expectation is extremely small [6] Sec. 10.3]. 

Remark 2. When errors are measured in the spectral norm, as opposed to the Frobenius norm, 
the randomized scheme is slightly further removed from optimality. Theorem 10.6 of [6] states that 

1/2 


( 10 ) 


E[||A — QB|| S pectral] < ( 1 + 


k 


s — 1 


&k+i + 


z\Jk + . 


' min(m,n) 

£ 

j=k +1 


<J ~ 


where e = 2.718 ■■■ is the basis of the natural exponent. We observe that in cases where the singular 
values decay slowly, the right hand side of © is substantially larger than the theoretically optimal 
value of ( Tfe+i- For such situation, the u power scheme” described in Section 5.1 should be used. 


3.3. Computing standard factorizations. The output of the randomized factorization scheme 
in Figure [l] is a factorization A ~ QB where Q is orthonormal, but no constraints have been placed 
on B. It turns out that standard factorizations can efficiently be computed from the factors Q and 
B; in this section we describe how to get the QR, the SVD, and “interpolatory” factorizations. 


3.3.1. Computing the low rank SVD. To get a low rank SVD, cf. Section 2.2, we perform the full 
SVD on the lx n matrix B, to obtain a factorization B = U DV. Then, 


A ss QB = QUDV*. 


We can now choose a rank k to use based on the decaying singular values of D. Once a suitable 
rank has been chosen, we form the low rank SVD factors: 

Ufc = QU(:, 1 : k), Z fc = D(l :k,l:k), and V fe = V(:, 1 : k), 

so that A ~ UfcZfcV)!. Observe that the truncation undoes the over-sampling that was done and 
detects a numerical rank k that is typically very close to the optimal e-rank. 


3.3.2. Computing the partial pivoted QR factorization. To obtain the factorization AP ~ QR, 


cf. Section 2.3, from the QB decomposition, perform a QR factorization of the lx n matrix B to 
obtain BP = QR. Then, set Q = QQ to obtain 


AP « QBP = QQR = QR. 
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3.3.3. Computing interpolatory and CUR factorizations. In applications such as data interpreta¬ 
tion, it is often of interest to determine a subset of the rows/columns of A that form a good basis 
for its row/column space. For concreteness, suppose that A is an m x n matrix of rank k, and that 
we seek to determine an index set J of length k, and a matrix Y of size k x n such that 

A « A (:,J) Y. 
m x n m x k k x n 



One can prove that there always exist such a factorization for which every entry of Y is bounded 
in modulus by 1 (which is to say that the columns in A(:, J) form a well-conditioned basis for 
the range of A) and for which Y (:, J) is the k x k identity matrix [2]- Now suppose that we have 
available a factorization A = QB where B is of size £ X n. Then determine J and Y such that 


( 12 ) 


B « B (:, J) Y. 

£ x n lx k k x n 


This can be done using the techniques in, e.g., [2] or [5J. Then (11) holds automatically for the 
index set J and the matrix Y that were constructed. Using similar ideas, one can determine a set of 
rows that form a well-conditioned basis for the row space, and also the so called CUR factorization 


A « C UR*, 

m x n m x k k x k k x n 


where C and R consist of subsets of the columns and rows of A, respectively, cf. [12]. 


4. A BLOCKED VERSION OF THE RANDOMIZED RANGE FINDER 


In this section, we describe and analyze a blocked version of the basic randomized scheme described 
in Figure [lj By blocking, we improve computational efficiency and simplify implementation on 
parallel machines. Blocking also allows for adaptive rank determination to be incorporated for sit¬ 
uations where the rank is not known in advance. While blocking greatly helps with computational 
efficiency, it also creates some issues in terms of aggregation of round-off errors; this problem can 


be eliminated using techniques described in Section 4.3 


The algorithm described in this section is directly inspired by Algorithm 4.2 of besides blocking, 
the scheme proposed here is different in that the matrix A is updated in a manner analogous 
to “modified” column-pivoted Gram-Schmidt. This updating allows the randomized stopping 
criterion employed in {6| to be replaced with a precise deterministic stopping criterion. 


4.1. Blocking. Converting the basic scheme in Figure [I] to a blocked scheme is in principle 
straight-forward. Suppose that in addition to an m x n matrix A and a rank £, we have set 
a block size b such that l = sb, for some integer s. Then draw an n x £ Gaussian random matrix 
and partition it into slices {S2j}j =1 , each of size n x 6, so that 

( 13 ) n= [ni,n 2 , 

We analogously partition the matrices Q and B in groups of b columns and b rows, respectively, 

Q = [Qi, Q 2 , • • ■, Q s ] and B = 

The blocked algorithm then proceeds to build the matrices {Q*K=i and {B.j}f =1 one at a time. 
We first initiate the algorithm by setting 

(14) A (0) = A. 


Bi 

B 2 

B, 
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Then step forwards, computing for i = 1, 2, ..., s the matrices 

(15) Qj = orth 

(16) Bj = 

(17) A w = A (i_1) - QjB t . 

We will next prove that the matrix Q, = [Qi, Q 2 , ..., Qi] constructed is indeed orthonormal, and 

that the matrix A (l ) defined by (17) is the “remainder” after i steps, in the sense that 


A« = A- [Qi, Q 2 , ..., Q i] [Qi, Q 2 , , Q* 

To be precise, we will prove the following proposition: 


A - Q f Q A. 


Proposition 4.1. Let A be an m x n matrix. Let b denote a block size, and let s denote the 
number of steps. Suppose that the rank of A is at least sb. Let S2 be a Gaussian random matrix 
of size n x sb, partitioned as in (13), with each £2j of size n x b. Let {A^}*-_ 0 , {Qj}* =1 , and 
{Bj}* =1 , be defined by (If) - (11). Set: 

(18) 

and 


p. = TQQ: 

3 = 1 


(19) Qi = [Qi, Q 2 ,-Qi] , B i= [BJ, B* 2 , ...,B* 

Then for every i = 1, 2,..., s, it is the case that: 


Y i — [A£2i, Aft 2 , • • •, A£1, 


(a) The matrix Qi is ON, so P, is an orthogonal projection. 

(b) A« = (I - Pi) A = (I - Q,Q*) A and B, = Q*A. 

(c) i?(Qi) = i2(Yi). 


Proof. The proof is by induction. We will several times use that if C is a matrix of size n x b of 
full rank, and we set Q = orth(C), then R( Q) = i?(C), where i?(X) denotes the range of X. We 
will also use the fact that if Q is a Gaussian random matrix of size n x £, and E is a matrix of size 
m x n with rank at least £, then the rank of E£2 is with probability 1 precisely £ [6]. 


Direct inspection of the definitions show that (a), (b), (c) are all true for i = 1. Suppose all 
statements are true for i — 1. We will prove that then (a), (b), (c) hold for i. 


To prove that (a) holds for i, we use that (b) holds for i 


1 and insert this into (15) to get 


( 20 ) 


Qi = orth((l - Pj-i)AJ2j). 


Then observe that P,_i is the orthogonal projection onto a space of dimension b{i — 1), which 
means that the matrix A^ -1 ^ = (I — Pj_i)A has rank at least bs — b(i — 1) = b(s — i + 1) > b. 
Consequently, A^ t_1, £2i has rank precisely b. This shows that 

R(Qi) c R (I - Pi_i) = i?(P i -i) ± = R([ Qi, Q 2 , ■ ■ •, Qi-i]) ± . 

It follows that Q*Q* = 0 whenever j < i which shows that Q,: is ON. Next, 

B, = Q*A (i_1 ) = Q* (I - Qi_iQ*_ x ) A^- 2 ) = Q*A^" 2 ) = • • ■ = Q*A^ = Q*A 


It follows that: 


Q*A = [Qr,..., Q,]*A 


Q*A 


Bi' 

Q* A 


Bi 
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Thus, (a) holds for i. 

Proving (b) is a simple calculation. Combining (16) and © we get 

A« = = (I —QjQ*)A^ -1 ^ = (l-Q i Q*)(l-P i _i)A = (l-(P i _ 1 + Q i Q?))A, 

where in the last step we used that Q*P,_i = 0. Since P; = Pi_i + Q,Q*, this proves (b). 


To prove (c), we observe that (20) implies that 

(21) R(Qi) CflQAni.Pi-iAni]). 

Induction assumption (c) tells us that 

(22) C R([ Qi, Q 2 , • • • , Q <_!]) = R([ Afii, A£2 2 , 
Combining ( |21| ) and (22), we find 

(23) i?(Qi) C R([Afii, Afi 2 , ••• , Afli_i, Afli]). 


Afli-i]). 


Equation (23) together with the induction assumption (c) imply that R([Qi, Q 2 , Qi]) C 
i?([AS2i, AS2 2 , • • • , A£2j]). But both of these spaces have dimension precisely bi , so the fact that 
one is a subset of the other implies that they must be identical. □ 


Let us next compare the blocked algorithm defined by relations (14) - © to the unblocked 
algorithm described in Figure [l] Let for a fixed Gaussian matrix the output of the blocked 
version be {Q B} and let the output of the unblocked method be {Q, B}. These two pairs of 
matrices do not need to be identical. (They depend on how exactly the QR factorizations are 
implemented, for instance). However, Proposition 4.1 demonstrates that the projectors QQ* and 
QQ* are identical. To be precise, both of these matrices represent the orthogonal projection onto 
the space R( AS2). This means that the error resulting from the two algorithms are also identical 


A - QQ*A 


A - QQ A 


error from blocked algorithm error from non blocked algorithm 


Consequently, all theoretical results given in (6j, cf. Section 3.2, directly apply to the output of the 
blocked algorithm too. 


4.2. Adaptive rank determination. The blocked algorithm defined by (14) - © was in Section 


4.1 presented for the case where the rank £ of the approximation is given in advance. A perhaps 


more common situation in practical applications is that a precision e > 0 is specified, and then 
we seek to compute an approximation of as low rank as possible that is accurate to precision 
e. Observe that in the algorithm defined by (14) - ©, we proved that after step i has been 
completed, we have 

||A«|| = || A — Pi A || = || A — [Qr Q 2 ••• Qi] [Qi Q 2 Q s ]* A||. 

In other words, A^ holds precisely the residual remaining after step i. This means that incorpo¬ 
rating adaptive rank determining is now trivial — we simply compute ||A^|| after completing step 
i, and break once ||A®|| < e. The algorithm resulting is shown as randQBJb in Figure [ij (The 


purpose of line (3’) will be explained in Section 4.3 


Remark 3. Recall that our default norm in this manuscript, the Frobenius norm, is simple to 
compute, which means that the check on whether to break the loop on line (1) in Figure hardly 
adds at all to the execution time. If circumstances warrant the use of a norm that is more expensive 
to compute, then some modification of the algorithm would be required. Suppose, for instance, that 
we seek an approximation in the spectral norm. We could then use the fact that the Frobenius 
norm is an upper bound on the spectral norm, keep the Frobenius norm as the breaking condition, 
















function [Q, B] = randQB_b(A, s, b) 



(1) 

for i = 1, 2, 3, ... 



(2) 

Qi = randn(n, b ) 



(3) 

Q ; : = orth(Afij) 


Cmm mnb + C qT mb 2 

(3’) 

Qi = orth: Q, - >:' QjQ*Qi) 


2 (i — 1 )C mm mb 2 + C qr mb 2 

(4) 

Bj = Q* A 


Cmmmnb 

(5) 

A = A - Q t B t 


Cmmmnb 

(6) 

if A < e then stop 



(7) 

end for 



(8) 

Set Q = [Qi • • • Qi] and B = [B* • • • 

B*]*- 



Figure 2. A blocked version of the randomized range finder, cf. Fig [TJ The al¬ 
gorithm takes as input an m x n matrix A, a block size b, and a tolerance e. Its 
output are factors Q and B such that || A - QB|| < e. Note that if the algorithm 
is executed in exact arithmetic, then line (3’) does nothing. Text in blue refers to 
computational cost, see Section 4.4 for notation. 


and then eliminate any “superfluous” degrees of freedom that were included in the post-processing 
step, cf. Section 3.3.1. (This approach would only be practicable for matrices whose singular values 


exhibit reasonable decay, as otherwise the discrepancy in the e-ranks would be prohibitively large.) 


4.3. Floating point arithmetic. When the algorithm defined by is carried out in 

finite precision arithmetic, a serious problem often arises in that round-off errors will accumulate 
and will cause loss of orthonormality among the columns in {Qi, Q 2 , The problem is that as 

the computation proceeds, the columns of each computed matrix Q, will due to round-off errors 
drift into the span of the columns of {Qi, Q 2 , • • •, Qi— i}- To fix this problem, we explicitly 
reproject Q, away from the span of the previously computed basis vectors [IJ. The line (3’) in 
Figure [2] represents the re-projection that is done to combat the accumulation of round-off errors. 
(Note that if the algorithm is carried out in exact arithmetic, then Q*Q, = 0 whenever j < i, so 
line (3’) would have no effect.) 


4.4. Comparison of execution times. Let us compare the computational cost of algorithms 
randQB (Figure [I]) and randQELb (Figure [2]). To this end, let C mm and C qi denote the scaling 
constants for the cost of executing a matrix-matrix multiplication and a full QR factorization, 
respectively. (While the algorithms only need the function orth, cf. Section 2.4, this cost is for 
practical purposes the same as the cost for QR factorization.) In other words, we assume that: 


• Multiplying two matrices of sizes m x n and n x r costs (7 mm mnr. 

• Performing a QR factorization of a matrix of size m x n, with m > n, costs C qv mn 2 . 


Note that these are rough estimates. Actual costs depend on the actual sizes (note that the 
costs are dominated by data movement rather than flops), but this model is still instructive. The 
execution time for the algorithm in Figure [l] is easily seen to be 


TrandQB 


2 Cmm mnl + C qr m£ 2 . 
9 


(24) 


rsj 






For the blocked algorithm of Figure [2j we assume that it stops after s steps and set i = sb. Then 


7randQB_b ~ E \3C mm mnb + 2 (i — 1 )C mm mb 2 + C qr 2?nb 2 ] 


2=1 


~ 3sC mm mnb + s 2 C mm mb 2 + sC qr 2mb 2 . 


Using that sb = £ we find 
(25) T rand q B j 


b ~ 3C mm rnn£ + C mm m£ 2 H— C qi m£ 2 . 


Comparing (24) and (25), we see that the blocked algorithm involves one additional term of 
C mm mn£ , but on the other hand spends less time executing full QR factorizations, as expected. 


Remark 4. All blocked algorithms that we present share the characteristic that they slightly in¬ 
crease the amount of time spent on matrix-matrix multiplication, while reducing the amount of time 
spent performing QR factorization. This is a good trade-off on many platforms, but becomes par¬ 
ticularly useful when the algorithm is executed on a GPU. These massively multicore processors are 
particularly efficient at performing matrix-matrix multiplications, but struggle with communication 
intensive tasks such as a QR factorization. 


5. A VERSION OF THE METHOD WITH ENHANCED ACCURACY 


5.1. Randomized sampling of a power of the matrix. The accuracy of the basic randomized 
approximation scheme described in Section [3j and the blocked version of Section [4] is well under¬ 
stood. The analysis of [6] (see the synopsis in Section 3.2) shows that the error ||A — A a 


'approx| 


depends strongly on the quantity ^^y=fc+i j • This implies that the scheme is highly accu¬ 
rate for matrices whose singular values decay rapidly, but less accurate when the “tail” singular 
values have substantial weight. The problem becomes particularly pronounced for large matrices. 
Happily, it was demonstrated in |9j that this problem can easily be resolved when given a matrix 
with slowly decaying singular values by simply applying a power of the matrix to be analyzed to 
the random matrix. To be precise, suppose that we are given an m x n matrix A, a target rank 
£, and a small integer P (say P = 1 or P = 2). Then the following formula will produce an ON 
matrix Q whose columns form an approximation to the range: 


£2 = randn(n, £), 


and 


Q = orth((AA*) P Aft,0). 


The key observation here is that the matrix (AA*V A has exactly the same left singular values as 
A, but its singular values are a 2P+l (observe that our objective is to build an ON-basis for the 
range of A, and the optimal such basis consists of the leading left singular vectors). Even a small 
value of P will typically provide enough decay that highly accurate results are attained. For a 
theoretical analysis, see (0 Sec. 10.4], 

When the “power scheme” idea is to be executed in floating point arithmetic, substantial loss of 
accuracy happens whenever the singular values of A have a large dynamic range. To be precise, 
if Cmach denotes the machine precision, then any singular components smaller than o\ e n (.) cli ' 
will be lost. This problem can be resolved by orthonormalizing the “sample matrix” between each 
application of A and A*. This results in the scheme we call randQEUp, as shown in Figure |3j (Note 
that this scheme is virtually identical to a classical subspace iteration with a random Gaussian 
matrix as the start 1101.') 
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function [Q, B] = randQB_p(A, l, P) 


(1) 

£2 = randn(n, £). 


(2) 

Q = orth(Afi). 

Cmm Tnn£ + C qr m£ 2 

(3) 

for j = 1 : P 


(4) 

Q = orth(A*Q). 

C mm mn£ + C qr ml 2 

(5) 

Q = orth(AQ). 

Cmm rural 4- C qr ml 2 

(6) 

end for 


(7) 

B = Q*A 

C m m rural 


Figure 3. An accuracy enhanced version of the basic randomized range finder in 
Figure [TJ The algorithm takes as input an m x ra matrix A, a rank l, and a “power” 
P (see Section 5.1). The output are matrices Q and B of sizes m x £ and i x ra 
such that A « QB. Higher P leads to better accuracy, but also higher cost. Setting 
P = 1 or P = 2 is often sufficient. 



function [Q, B] = randQB_pb(A, e, P, b ) 


(1) 

for i = 1,2,3,... 


(2) 

£L, = randn(ra, b). 


(3) 

Q i = orth(A£2, ; ). 

C mm mnb + C qr mb 2 

(4) 

for j = 1 : P 


(5) 

Q i = orth(A*Qj). 

C mm rnnb + C qr mb 2 

(6) 

Q { ; = orth(AQj). 

Cm m mnb + C qr mb 2 

(7) 

end for 


(8) 

Q, = orth(Q, - Y?j =! Q;Q)Q ) 

2{i — l)C mm rra6 2 + C qr rra6 2 

(9) 

B, = Q*A 

Cmmmnb 

(10) 

A = A - QjBj 

Cmmmnb 

(11) 

if A < e then stop 


(12) 

end while 


(13) 

Set Q = [Qi • • ■ QJ and B = [BJ ■ • • B*}*. 



Figure 4. A blocked and adaptive version of the accuracy enhanced algorithm 
shown in Figure [3j Its input and output are identical, except that we now provide 
a tolerance e as an input (instead of a rank), and also a block size b. 


5.2. The blocked version of the power scheme. A blocked version of randQELp is easily 
obtained by a process analogous to the one described in Section 4.1, resulting in the algorithm 
“randQELpb” in Figure [4j Line (8) combats the problem of incremental loss of orthonormality 


when the algorithm is executed in finite precision arithmetic, cf. Section 4.3 


5.3. Computational complexity. When comparing the computational cost of randQB_p (cf. Fig¬ 
ure [3]) versus randQELpb (cf. Figure [IJ , we use the notation that was introduced in Section 


4.4 By 


inspection, we directly find that 


F r andQB_p ~ C mm (2 + 2 P) mn£ + C qr (l + 2P)mP. 
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For the blocked scheme, inspection tells us that 

s 

^randQB_pb rs -' E [Cmm(3 + 2 P) mnb + 2 (i — 1 )C mm rnb 2 + C' qr (2 + 2 P)mb 2 ] . 
i=l 

Executing the sum, and utilizing that £ = sb, we get 

E r andQB_pb ~ Cmm(3 + 2P) mTll + C mm m£ 2 + — Cq r (2 + 2 P)mP. 

In other words, the blocked algorithm again spends slightly more time executing matrix-matrix 
multiplications, and quite a bit less time on qr-factorizations. This trade is often favorable, and 
particularly so when the algorithm is executed on a GPU (cf. Remark [4]). On the other hand, 
when £ <C n, the benefit to saving time on QR factorizations is minor. 

5.4. Is re-orthonormalizing truly necessary? Looking at algorithm randQELp, it is very tempt¬ 
ing to skip the intermediate QR factorizations and simply execute steps (2) - (6) as: 


(2) Y = AST 

(3) for j = 1 : P 

(4) Y = A(A*Y). 

(5) end for 

(6) Q = orth(Y) 


This simplification does speed things up substantially, but as we mentioned earlier, it can lead to 
loss of accuracy. In this section we state some conjectures about when re-orthonormalization is 
necessary. These conjectures appear to show that the blocked scheme is much more resilient to 
skipping re-orthonormalization. 


To describe the issue, let us fix a (small) integer P , and define the matrix 

A P = (AA*) P A. 

If the SVD of A is A = UZV*, then the SVD of Ap is 


A P = UZ 2P+1 V*. 

In computing A = ApS2, we lose all information about any singular mode i for which c 2P+1 < 
cr 2P+1 e mac h, where e mac h is the machine precision. In other words, in order to accurately resolve 
the first k singular modes, re-orthogonalization is needed if 


(26) 




1/(2P+1) 


mach 


As an example, with P = 2 and e mac h = 10 —15 , we find that u/acif^ = 10~ 3 ) so re-orthonormalization 
is imperative resolve any components smaller than a\ - ■ ■ 1CU 3 . Moreover, if we skip re-orthonormalization, 
we are likely to see an overall loss of accuracy affecting singular values and singular vectors asso¬ 
ciated with larger singular values. 


Next consider the blocked scheme. 


The crucial observation is that now, instead of trying to 
k 


extract the whole range of singular values {ay }| =1 (and their associated eigenvectors) at once, we 


now extract them in s groups of b modes each, where k 
get reasonable accuracy as long as 


sb. This means that we can expect to 


(27) 


flR-ljb+l 1/(2P+1) 

_ — Lnach ‘ 

°ib 


for i = 1, 2, 


Comparing (26) and (27), we see that (27) is a much milder condition, in particular when the 
block size b is much smaller than k. 
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All claims in this section are heuristics. However, while they have not been rigorously proven, 


they are supported by extensive numerical experiments, see Section 6.3 


6. Numerical experiments 


In this section, we present numerical examples that illustrate the computational efficiency and the 
accuracy of the proposed scheme, see Sections |6.1| and \6.2\ respectively. The codes we used are 


available at http://amath.colorado.edu/faculty/martinss/main_codes.html and we encour¬ 
age any interested reader to try the methods out, and explore different parameter sets than those 
included here. 


6.1. Comparison of execution speeds. We first compare the run times of different techniques 
for computing a partial (rank k) QR factorization of a given matrix A of size n x n. Observe that 
the choice of matrix is immaterial for a run time comparison (we investigate accuracy in Section 


6.2). We compared three sets of techniques: 


• Truncating a full QR factorization, computed using the Intel MKL libraries. 

• Taking k steps of a column pivoted Gram-Schmidt process. The implementation was 
accelerated by using MKL library functions whenever practicable. 

• The blocked “QB” scheme, followed by postprocessing of the factors to obtain a QR factor¬ 
ization. We used the “power method” described in Section [5] with parameters P = 0,1,2. 


The algorithms were all implemented in C and run on a desktop with a 6-core Intel Xeon E5-1660 
CPU (3.30 GHZ), and 128GB of RAM. We also ran the blocked “QB” scheme on an NVIDIA 
Tesla K40c GPU installed on the same machine, using the Matlab GPU computing interface. The 
results are shown in Figure [5] Figure [6] shows the dependence of the runtime on the block size. 

Figure [5] shows that our blocked algorithms (blue, magenta, and cyan lines) compare favorably 
to both of the two benchmarks we chose — full QR using MKL libraries (green) and partial 
factorization using column pivoting (green). However, it must be noted that our implementation 
of column pivoted QR is far slower than the built-in QR factorization in the MKL libraries. Even 
for as low of a rank as k = 100, we do not break even with a full factorization until n = 8 000. 
This implies that column pivoting can be implemented far more efficiently than we were able to. 
The point is that in order to attain the efficiency of the MKL libraries, very careful coding that is 
customized to any particular computing platform would have to be done. In contrast, our blocked 
code is able to exploit the very high efficiency of the MKL libraries with minimal effort. 

Finally, it is worth nothing how particularly efficient our blocked algorithms are when executed on 
a GPU. We gain a substantial integer factor speed-up over CPU speed in every test we conducted. 


6.2. Accuracy of the randomized scheme. We next investigate the accuracy of the randomized 
schemes versus column-pivoted QR on the one hand (easy to compute, not optimal) and versus the 
truncated SVD on the other (expensive to compute, optimal). We used 5 classes of test matrices 
that each have different characteristics: 

Matrix 1 (fast decay): Let Ai denote an m x n matrix of the form Ai = UDV* where U 
and V are randomly drawn matrices with orthonormal columns (obtained by performing 
qr on a random Gaussian matrix), and where D is a diagonal matrix with entries roughly 
given by dj = gj f3' J ~ l where gj is a random number drawn from a uniform distribution on 
[0,1] and j3 = 0.65. To precision 10~ 15 , the rank of Ai is about 75. 

Matrix 2 (slow decay): The matrix A 2 is formed just like Ai, but now the diagonal entries 
of D decay very slowly, with dj = (1 + 200 (j — 1)) 1//2 . 
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10 4 


n 

Figure 5. Timing results for different algorithms on CPU and GPU. The integer 
P denotes the parameter in the “power scheme” described in Section [5j 


Time for randQBblocked (k=100) for different block sizes Time for randQBblocked (k=2000) for different block sizes 



Figure 6 . Timing results for blocked QB scheme for different block sizes b. The 
integer P denotes the parameter in the “power scheme” described in Section [5] 

Matrix 3 (sparse): The matrix A 3 is a sparse matrix given by A 3 = £^=1 7 x / y)+Xu=ii TM ^ 7 x j y) 
where Xj and y ■ are random sparse vectors generated by the Matlab commands sprand(m, 1, 0.01) 
and sprand(n, 1, 0.01), respectively. We used m = 800 and n = 600 with resulted in a ma¬ 
trix with roughly 6 % non-zero elements. This matrix was borrowed from Sorensen and 
Ernbree im and is an example of a matrix for which column pivoted Gram-Schmidt per¬ 
forms particularly well. 

Matrix 4 (Kahan): This is a variation of the “Kahan counter-example” which is a matrix 
designed so that Gram-Schmidt performs particularly poorly. The matrix here is formed 
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via the matrix matrix product SK where: 


1 

0 

0 

0 

r—1 


1 — </> —(f> —cf) ■ ■ ■ 

0 c 0 0 ••• 


0 1 -0 -cj) ■■■ 

0 0C 2 0 ••• 

and K = 

0 0 1 -(/)■■■ 

00 0 c 3 ••• 


00 0 1 ••• 





with random £, cj) > 0, ( 2 + (j> 2 = 1. Then SK is upper triangular, and for many choices of £ 
and 4>, classical column pivoting will yield poor performance as the different column norms 
will be similar and pivoting will generally fail. The rank-A; approximation resulting from 
column pivoted QR is substantially less accurate than the optimal rank-A; approximation 
resulting from truncating the full SVD [5j- However, we obtain much better results than 
QR with the QB algorithm. 

Matrix 5 (S shaped decay): The matrix A5 is built in the same manner as Ai and A2, 
but now the diagonal entries of D are chosen to first hover around 1, then decay rapidly, 


and then level out at a relatively high plateau, cf. Figure 11 


We compare four different techniques for computing a rank-A; approximation to our test matrices: 


SVD: We computed the full SVD (using the Matlab command svd) and then truncated to 
the first k components. 

Column-pivoted QR: We implemented this using modified Gram-Schmidt with reorthog- 
onalization to ensure that orthonormality is strictly maintained in the columns of Q. 
r andQ B — single vector: This is the greedy algorithm labeled “Algorithm 1” in Section 
1.2l implemented with q y on line (4) chosen as q ; = y/||y|| where y = (AA*) Aw and 
where u> is a random Gaussian vector. 
randQB — blocked: This is the algorithm randQB_pb shown in Figure |4j 

The results are shown in Figure [T] - [llj We make three observations: (1) When the “power 
method” described in Section [5] is used, the accuracy of randQB_pb exceeds that of column-pivoted 
QR in every example we tried, even for as low of a power as P = 1. (2) Blocking appears to lead 
to no loss of accuracy. In most cases, there is no discernible difference in accuracy between the 
blocked and the non-blocked versions. (3) The accuracy of randQB.pb is particularly good when 
errors are measured in the Frobenius norm. In almost all cases we investigated, essentially optimal 
results are obtained even for P = 1. 

Figures [ 7 ] 1 11 1 report the errors resulting from a single instantiation of the randomized algorithm. 
Appendix [A| provides more details on the statistical distribution of errors. 
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Frobenius norm 



k 


Figure 7. Errors for the 800 x 600 “Matrix 1” whose singular values decay very 
rapidly. The block size is b = 10. 



Frobenius norm 



Figure 8. Errors for the 800 x 600 “Matrix 2” whose singular values decay slowly. 
The block size is b = 10. 
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Spectral norm 


Frobenius norm 




Figure 9. Errors for the 800 x 600 “Matrix 3.” This is a sparse matrix for which 
column-pivoted Gram-Schmidt performs exceptionally well. However, randQB still 
gives better accuracy whenever a power P > 1 is used. 


Spectral norm 




0 20 40 60 80 100 

k 


Figure 10. Errors for the 1000 x 1000 “Matrix 4.” This matrix is a variation of 
the “Kahan counter-example” and is designed specifically to give poor performance 
for column pivoted QR. Here b = 20. 
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Spectral norm 


Frobenius norm 




Figure 11. Errors for the 800 x 600 “Matrix 5 ,: 
“S-shaped” decay. Here b = 15. 


whose singular values show an 


6.3. When re-orthonormalization is required. We claimed in Section 5.4 that the blocked 


scheme is more robust towards loss of orthonormality than the non-blocked scheme presented in 
[ 6 ]. To test this hypothesis, we tested what happens if we skip the re-orthonormalization between 
applications of A and A* in the algorithms shown in Figures [3] and [4} The results are shown 
in Figure [12J The key observation here is that the blocked versions of randQB still always yield 
excellent precision. When the block size is large, the convergence is slowed down a bit compared 
to the more meticulous implementation, but essentially optimal accuracy is nevertheless obtained 
relatively quickly. 


Remark 5. The numerical results in Figure 12 substantiate the claim that for the unblocked 
version, the best accuracy attainable is o\ +1 ^. In all examples, we have cj\ = 1, so the 
prediction is that for P = 1 the maximum precision is (10~ 15 ) = 10”' 5 and for P = 2 it is 

(10- 15 ) 1/5 = 10 -3 . The results shown precisely follow this pattern. Observe that for A 2 , no loss 
of accuracy is seen at all since the singular values we are interested in level out at about 10 -2 . 
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Errors for "Matrix 5" with P=1 (skipping qr) 



Errors incurred when not re-orthonormalizing between applications of 

The non-blocked scheme 


Figure 12 

A and A* in the “power method,” cf. Sections |5.4| and 6.3 


(red) performs precisely as predicted, and cannot resolve anything beyond precision 
ICE 5 when P = 1 and ICE 3 when P = 2. The blocked version converges slightly 
slower when skipping re-orthonormalization but always reaches full precision. 


7. Concluding remarks 

We have described a randomized algorithm for the low rank approximation of matrices. The 
algorithm is based on the randomized sampling paradigm described in |7[ [9, 6j [8]. In this article, 
we introduce a blocking technique which allows us to incorporate adaptive rank determination 
without sacrificing computational efficiency, and an updating technique that allows us to replace 
the randomized stopping criterion proposed in [6] with a deterministic one. Through theoretical 
analysis and numerical examples, we demonstrate that while the blocked scheme is mathematically 
equivalent to the non-blocked scheme of EIEIEHH] when executed in exact arithmetic, the blocked 
scheme is slightly more robust towards accumulation of round-off errors. 
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The updating strategy that we propose is directly inspired by a classical scheme for computing a 
partial QR factorization via the column pivoted Gram-Schmidt process. We demonstrate that the 
randomized version that we propose is more computationally efficient than this classical scheme 
(since it is hard to block the column pivoting scheme). Our numerical experiments indicate that 
the randomized version not only improves speed, but also leads to higher accuracy. In fact, in all 
examples we present, the errors resulting from the blocked randomized scheme are very close to 
the optimal error obtained by truncating a full singular value decomposition. In particular, when 
errors are measured in the Frobenius norm, there is almost no loss of accuracy at all compared to 
the optimal factorization, even for matrices whose singular values decay slowly. 

The scheme described can output any of the standard low-rank factorizations of matrices such as, 
e.g., a partial QR or SVD factorization. It can also with ease produce less standard factorizations 
such as the “CUR” and “interpolative decompositions (ID)”, cf. Section 
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Appendix A. Distribution of errors 


The output of our randomized blocked approximation algorithms is a random variable, since it 
depends on the drawing of a Gaussian matrix £2. It has been proven (see, e.g., ED that due to 
concentration of mass, the variation in this random variable is tiny. The output is for practical 
purposes always very close to the expectation of the output. For this reason, when we compared 


the accuracy of the randomized method to classical methods in Section 6.2, we simply presented 
the results from one particular draw of £2. In this section, we provide some more detailed numerical 
experiments that illuminate exactly how little variation there is in the output of the algorithm. 
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We present the results from all matrices considered in Section 6.2 except Matrix 4 (the so called 
“Kahan counter example”) since this is an artificial example concocted specifically to give poor 
results for column-pivoted Gram Schmidt. 


Figures [13] through |20| provide more information about the statistics of the outcome for the exper¬ 
iments reported for a single instantiation in Figures [7| through |11[ For each experiment, we show 
both the empirical expectation of the accuracy, and the error paths from 25 different instantiations. 
We observe that the errors are in all cases tightly clustered, in particular for P = 1 and P = 2. We 
also observe that when the singular values decay slowly, the clustering is stronger in the Frobenius 
norm than in the spectral norm. 


In our final set of experiments, we increased the number of experiments from 25 to 2 000. To keep 
the plots legible, we plot the errors only for a fixed value of k, see Figure [2Tj These experiments 
further substantiate our claim that the results are tightly clustered, in particular when the “power 
method” is used, and when the Frobenius norm is used. 
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Figure 13. The empirical mean errors from 25 instantiations of the randomized 
factorization algorithm applied to Matrix 1. 




Figure 14. The error paths for 25 instantiations of the randomized factorization 
algorithm applied to Matrix 1. 
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Figure 15. The empirical mean errors from 25 instantiations of the randomized 
factorization algorithm applied to Matrix 2. 




Figure 16. The error paths for 25 instantiations of the randomized factorization 
algorithm applied to Matrix 2. 
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Spectral norm - average errors over 25 experiments Frobenius norm - average errors over 25 experiments 




k k 

Figure 17. The empirical mean errors from 25 instantiations of the randomized 
factorization algorithm applied to Matrix 3. 

Spectral norm - errors for 25 experiments Frobenius norm - errors for 25 experiments 




Figure 18. The error paths for 25 instantiations of the randomized factorization 
algorithm applied to Matrix 3. 
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Spectral norm - average errors over 25 experiments Frobenius norm - average errors over 25 experiments 
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Figure 19. The empirical mean errors from 25 instantiations of the randomized 
factorization algorithm applied to Matrix 5. 

Spectral norm - errors for 25 experiments Frobenius norm - errors for 25 experiments 




Figure 20. The error paths for 25 instantiations of the randomized factorization 
algorithm applied to Matrix 5. 
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Figure 21. Each blue cross in the graphs represents one instantiation of the ran¬ 
domized blocked algorithm. The x- and y-coordinates show the relative errors in 
the spectral and Frobenius norms, respectively. For reference, we also include the 
error from classical column-pivoted Gram-Schmidt (the magenta diamond), and 
the error incurred by the truncated SVD. The dashed lines are the horizonal and 
vertical lines cutting through the point representing the SVD — since these errors 
are minimal, every other dot must be located above and to the right of these lines. 
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